---
title: "Sabah RSPO Certification Replication"
author: "Anonymous"
format: docx
---

# Overview

This document contains the full replication code for "Can't teach an old plantation new tricks? Oil palm age, sustainability certification, and deforestation in Sabah, Malaysia." The analysis proceeds in four broad stages, each corresponding to a major section of the manuscript:

1. **Data preparation**: Reading in geospatial data on estate boundaries, RSPO certification dates, and a range of covariates; constructing the longitudinal panel dataset used throughout the analysis.
2. **Proportional hazards modeling**: Estimating the predictors of RSPO certification onset and predictors of estates' first replanting.
3. **Propensity-score matching and difference-in-differences analysis**: Estimating the average causal effect of RSPO certification on forest loss.
4. **Entropy balancing and dose-response estimation**: Estimating the relationship between median oil palm age and deforestation outcomes, separately for certified and uncertified plantations.

All spatial data are processed using the `sf` (Pebesma, 2018; Pebesma & Bivand, 2023) and `terra` (Hijmans, 2024) packages. Zonal statistics - that is, the extraction of raster values within polygon boundaries - are computed using `exactextractr` (Baston, 2023). Matching is implemented using `MatchIt` (Ho et al., 2011), difference-in-differences estimation using `did` (Callaway & Sant'Anna, 2021a), and entropy balancing using `WeightIt` (Greifer, 2025). Marginal predictions for dose-response curves are computed using `marginaleffects` (Arel-Bundock et al., 2024). Balance diagnostics use `cobalt`. All visualizations use `ggplot2` as part of the `tidyverse`.

# Activating libraries

The following code loads all R packages required for the analysis. Brief notes on the role of each package are provided inline.

```{r}

library(tidyverse)      # Core data manipulation and visualization (includes ggplot2, dplyr, tidyr, etc.)
library(sf)             # Simple Features: reading, manipulating, and writing vector geospatial data
library(did)            # Callaway & Sant'Anna (2021) difference-in-differences estimator
library(MatchIt)        # Propensity-score matching (Ho et al., 2011)
library(WeightIt)       # Entropy balancing weights for continuous treatments (Greifer, 2025)
library(marginaleffects) # Marginal predictions and dose-response curves (Arel-Bundock et al., 2024)
library(cobalt)         # Covariate balance diagnostics for matched and weighted samples
library(patchwork)      # Combining multiple ggplot2 figures into composite layouts
library(car)            # Companion to Applied Regression; used here for variance inflation factor (VIF) checks
library(survival)       # Computing proportional hazards models

```

# Setting working directory

All data files are stored locally. The working directory below will need to be updated to match the location of the replication data on a given machine.

```{r}
setwd("")
```

# Set default plotting theme

The custom `theme_rspo` object defines a consistent visual style applied to all figures in the manuscript. It is built on `theme_light()` and sets bold axis labels, titles, and legend text at standardized sizes to ensure legibility in exported figures.

```{r}

theme_rspo <- theme_light() +
  theme(axis.title.x=element_text(size=18, colour="black", face="bold"),
        axis.text.x=element_text(size=12, colour="black", 
                                 face="bold"),
        axis.title.y=element_text(size=18, colour="black",
                                  face="bold"),
        axis.text.y=element_text(size=12, colour="black",
                                 face="bold"),
        plot.title = element_text(size=18, colour="black",
                                  face="bold"),
        strip.text=element_text(size=12, colour="black",
                                face="bold"),
        legend.title=element_text(size=18, colour="black",
                                  face="bold"),
        legend.text=element_text(size=12, colour="black",
                                 face="bold"),
        legend.position = "bottom")

```

# Read in final data
```{r}

plants_model <- readRDS("plants_data.rds")
plants_buff_model <- readRDS("plants_buffer_data.rds")

```

# Descriptive Plots
## Prepare forest summary figure

This section produces a two-panel figure summarizing the evolution of forest cover across Sabah's oil palm estates from 2001 to 2023. The left panel shows time series of key summary statistics (total remaining forest, annual deforestation, cumulative deforestation, etc.) separately for RSPO-certified and uncertified plantations. The right panel shows kernel density distributions of the fraction of each plantation's area that remained under natural forest at three key time points: first observed, first certified, and last observed. The dotted vertical line in the time series marks 2015, the year Sabah's Jurisdictional Approach began.

```{r}

def_summary <- plants_model |>
  st_drop_geometry() |>
  group_by(year, rspo_active) |>
  summarise(forloss_area = sum(forloss_area),
            forloss_peat_area = sum(forloss_peat_area),
            rem_forest = sum(rem_forest),
            rem_peat = sum(rem_peat),
            plant_area = sum(area)) |>
  group_by(rspo_active) |>
  mutate(cum_forloss = cumsum(forloss_area),
         cum_peatloss = cumsum(forloss_peat_area)) |>
  group_by(year) |>
  pivot_longer(cols = 3:8) |>
  mutate(rspo_active = ifelse(rspo_active, "RSPO Certified", "Not Certified"),
         var = case_match(name,
                         "cum_forloss" ~ "Cumulative Deforestation",
                         "cum_peatloss" ~ "Cumulated Peat Deforestation",
                         "forloss_area" ~ "Annual Deforestation",
                         "forloss_peat_area" ~ "Annual Deforested Peat",
                         "rem_forest" ~ "Remaining Forest",
                         "plant_area" ~ "Plantation Area",
                         "rem_peat" ~ "Remaining Forested Peat")) |>
  ggplot(aes(x = year, y = value, linetype = rspo_active, id = rspo_active)) +
    geom_vline(xintercept = 2015, linetype = 3) +
    geom_line(linewidth = 1.5) +
    facet_wrap(~var, ncol = 2, scales = "free_y") +
    labs(x = "",
         y = "Square Kilometers",
         linetype = "") +
    theme_rspo

hist_forest <- plants_model |>
  mutate(period = case_when((year == 2001)|(year == op_start) ~ "First Observed",
                            (year == (rspo_start+1)) ~ "First Certified",
                            (year == 2023) ~ "Last Observed",
                            .default = NA) |>
           factor(levels = c("First Observed", "First Certified", "Last Observed"),
                  ordered=TRUE),
         rspo_active = ifelse(rspo_active, "RSPO Certified", "Not Certified")) |>
  filter(!is.na(period)) |>
  ggplot(aes(x = rem_forest_area_prop_estate)) +
  geom_density(fill = "gray20") +
  scale_x_continuous("Area under Forest (% of Estate)", labels = scales::percent) +
  scale_y_continuous("Density") +
  facet_grid(period ~ rspo_active, scales = "free_y") +
  theme_rspo

png("forest_summary.png", width = 16, height = 8, units="in", res = 150)
def_summary + hist_forest
dev.off()

```

## Plot histogram of plantation age distribution in 2023

This section produces a figure showing the distribution of median oil palm age across Sabah's plantations in 2023, separated by RSPO certification status. The figure illustrates the finding that uncertified plantations tend to have substantially older oil palm stock than certified ones — a central result of the study.

```{r}

age_plot <- plants_model |>
  filter(year == 2023) |>
  mutate(rspo_cond = ifelse(rspo_active, "RSPO-Certified", "Uncertified")) |>
  ggplot(aes(x = age_median)) +
    geom_histogram(fill="black", color="darkgray", linewidth=0.75, bins = 10) +
    facet_wrap(~rspo_cond, ncol = 1) +
    xlab("Median Age of Oil Palm (Years)") +
    ylab("Plantations") +
    ggtitle("Oil Palm Age in Sabah, 2023") +
    theme_rspo

png("oil_palm_ages.png", width = 6, height = 6, units = "in", res = 150)
age_plot
dev.off()

```

## Plot evolution of age and size over time

This section produces a figure showing the Pearson correlation between estate size (used as a proxy for capital availability) and median oil palm age over the observation period. The figure provides context for the interpretation of the oil palm age effects by showing that, while larger plantations tended to have younger stock early in the period (consistent with larger firms having greater replanting capacity), this relationship weakens substantially over time as even large plantations accumulate aging stock.

```{r}

plants_ages <- plants_model |>
  group_by(year) |>
  reframe(area_age_cor = cor(area, age_median)) |>
  ggplot(aes(x = year, y = area_age_cor)) +
  geom_line() +
  labs(title = "Correlation between Estate Size\nand Oil Palm Age over Time",
       x = "Year",
       y = "Correlation (Pearson)")+
  theme_rspo

png("oil_palm_size_age_cor.png", width = 6, height = 6, units = "in", res = 150)
plants_ages
dev.off()
  

```

## Plot evolution of age over time

This section produces boxplots showing the distribution of median oil palm age across estates for each year from 2001 to 2023. The figure documents the divergence in oil palm ages that begins around 2015, as some plantations undertook replanting while others did not, creating increasing heterogeneity in age structures over time.

```{r}

plants_ages_ts <- plants_model |>
  ggplot(aes(x = factor(year), y = age_median)) +
  geom_boxplot() +
  labs(title = "Evolution of Oil Palm Ages, by Estate",
       x = "Year",
       y = "Median Oil Palm Age")+
  theme_rspo +
  theme(axis.text.x = element_text(angle = -45))

png("oil_palm_age_ts.png", width = 10, height = 6, units = "in", res = 150)
plants_ages_ts
dev.off()
  

```

# Proportional hazards model for RSPO certification onset

As a first step in the analysis and to investigate possible selection effects — that is, whether the decision to seek RSPO certification is systematically associated with observable plantation characteristics — we estimate a Cox (1972) proportional hazards model.

In survival analysis, the *hazard* is the instantaneous conditional probability of an event occurring at a given time, given that it has not yet occurred. A proportional hazards model estimates how each covariate is associated with an increase or decrease in this hazard relative to the baseline rate. The resulting coefficient estimates, expressed as hazard ratios, indicate the multiplicative change in certification probability associated with a one-unit (here, one standard deviation, since all covariates are standardized) increase in each predictor.

The dataset for this model is structured so that each observation is a plantation-year, with the sample restricted to the period from 2008 (when certification first became possible) through each plantation's year of first certification (or 2023 for never-certified plantations). The dependent variable is `rspo_active`, which equals 1 in the year of first certification and 0 in all prior years. The model includes a cluster-robust variance estimator (`cluster = po_id`) to account for the fact that multiple observations from the same plantation are not independent.

Before fitting the model, a variance inflation factor (VIF) check is performed to assess multicollinearity among the predictors. VIF values substantially above 5-10 would indicate that two or more predictors are highly correlated and that coefficient estimates may be unstable.

```{r}

plants_cert_model <- plants_model |>
  st_drop_geometry() |>
  ungroup() |>
  filter((year <= rspo_start) &
         (year >= 2008)) |>
  mutate(rspo_active = as.numeric(rspo_active),
         # Standardize all continuous predictors to facilitate interpretation
         # of hazard ratios as effects of one standard deviation increase
         elev = elev |> scale(),
         slope = slope |> scale(),
         rem_forest_area_prop_estate  = rem_forest_area_prop_estate |> scale(),
         peat_prop_area = peat_prop_area |> scale(),
         age_median = age_median |> scale(),
         Time2Port = Time2Port |> scale(),
         population = population |> log() |> scale(),
         area = area |> scale(),
         po_id = paste0("po-", fid))

# Check for multicollinearity using variance inflation factors
cert_check <- lm(rspo_active ~ elev + slope + rem_forest_area_prop_estate + peat_prop_area + 
                   age_median + population + Time2Port + area, data = plants_cert_model)
vif(cert_check)

# Fit Cox proportional hazards model
# Surv(time, event): time = years since 2000, event = certification onset
# id and cluster = plantation ID for repeated-observations adjustment
plants_cert_surv <- coxph(Surv(time=year-2000, event=rspo_active) ~ 
                           elev + slope + rem_forest_area_prop_estate + peat_prop_area + age_median + 
                            population + Time2Port + area,
                         data = plants_cert_model, id = po_id, cluster = po_id)


```

## Plot survival model coefficients

This section produces a coefficient plot showing the hazard ratios and 99% confidence intervals from the proportional hazards model. Coefficients are extracted at the 99% confidence level using `summary(..., conf.int = 0.99)$conf.int`. A vertical reference line at hazard ratio = 1 marks the null hypothesis of no effect. Coefficients to the right of 1 are associated with faster certification onset; those to the left with slower onset.

```{r}

coefs_surv <- summary(plants_cert_surv, conf.int = 0.99)$conf.int |>
  as_tibble() |>
  select(c(1,3,4)) |>
  mutate(term = c("Average Elevation (m)",
                 "Average Slope (Deg.)",
                 "Remaining Forested Area (%)",
                 "Remaining Forested Peatland (%)",
                 "Median Age of Oil Palm",
                 "Average Population Density w/n 2 KM (ln)",
                 "Travel Time to Port (ln)",
                 "Estate Area (sq.km, ln)"))

names(coefs_surv)[1:3] <- c("coef", "lower", "upper")

surv_cfs_plot <- ggplot(coefs_surv, aes(x = coef, y = term)) +
  geom_vline(xintercept = 1, linetype = 2, linewidth = 0.75, color="black") + 
  geom_point(size = 4, color = "black") +
  geom_segment(aes(x = lower, xend = upper, y = term, yend = term),
               linewidth = 1, color = "darkgray") +
  scale_x_continuous("Hazard Ratio for One Standard-Deviation\nIncrease in Independent Variable") +
  scale_y_discrete("") +
  ggtitle("Proportional Hazard Model Coefficients\nDV = RSPO Certification Onset",
          subtitle = paste0("N = ", nrow(plants_cert_model), " Estate-Years")) +
  theme_rspo

surv_cfs_plot

png("surv_coefs.png", width = 9, height = 5, units = "in", res =150)
surv_cfs_plot
dev.off()

```

## Plot the baseline hazard of certification over time

The baseline cumulative certification rate shown in the manuscript is derived from the fitted Cox model using `survfit()`. The baseline hazard represents the expected cumulative probability of certification over time, *conditional on the covariate values in the model* — that is, adjusted for all predictors. The plot shows the estimated cumulative percentage of plantations certified by each year, with a dotted vertical line marking 2015, the year Sabah's Jurisdictional Approach began.

```{r}

haz <- survfit(plants_cert_surv) |>
  tidy() |>
  mutate(rspo_cert_est = (1 - estimate),
         rspo_cert_low = (1 - conf.high),
         rspo_cert_high = (1 - conf.low),
         time = time + 2000) |>
  ggplot(aes(x = time, y = rspo_cert_est)) +
    geom_vline(xintercept = 2015, linetype = 3, linewidth = 0.75) +
    geom_ribbon(aes(ymin = rspo_cert_low, ymax = rspo_cert_high),
                fill = "lightgray", alpha = 0.5) +
    geom_line() +
    scale_x_continuous("", breaks = c(2008, 2012, 2016, 2020)) +
    scale_y_continuous("Expected Baseline Plantations\nRSPO Certified (%)",
                       labels = scales::percent) +
    ggtitle("Expected Cumulative RSPO Certifications in Sabah",
            subtitle = "Estimated conditional on covariates in Cox proportional hazards model in Figure 2") +
  theme_rspo

png("rspo_surv.png", width = 9, height = 5, units = "in", res =150)
haz
dev.off()

```

# Proportional hazards model for time to first replanting

This model estimates the predictors of oil palm replanting on Sabahan plantations, defined as the first year in which a plantation's estimated median oil palm age declines relative to the prior year. Such a decline indicates that a sufficient area has been replanted to shift the plantation-wide weighted median downward.

Most plantations experience at most one replanting event over the 2001-2023 observation period, with a maximum of two. Box-Steffensmeier & De Boef (2006) demonstrate that the conditionalfrailty model - while generally preferred for recurrent events data - produces rare events bias and poorly identified frailty variance estimates when events per subject are this sparse, particularly because the model's event-stratified baseline hazard cannot be reliably estimated from near-zero second-event data. A standard Cox (1972) model for time to first replanting is therefore the more defensible specification. The model is directly analogous to the certification onset model, and coefficients are expressed as hazard ratios for a one-standard-deviation increase in each predictor, facilitating comparison.

```{r}

#Prepare the dataset
plants_replant_first <- plants_model |>
  st_drop_geometry() |>
  ungroup() |>
  # Ensure chronological order within each plantation
  arrange(fid, year) |>
  group_by(fid) |>
  mutate(
    age_median_lag = lag(age_median),
    # Replanting event: 1 if median age declined from prior year, else 0
    replant_event = as.numeric(age_median < age_median_lag)
  ) |>
  # Drop first observation per plantation (no lag available)
  filter(!is.na(age_median_lag)) |>
  # Restrict to rows up to and including the FIRST replanting event.
  # cumsum(replant_event) increments at each event; keeping rows where
  # this cumulative sum is <= 1 retains all pre-event years plus the
  # first event year, then stops — removing the plantation from the risk
  # set after its first event exactly as a first-event Cox model requires.
  filter(cumsum(replant_event) <= 1) |>
  ungroup() |>
  mutate(
    # Counting process time interval: (year-1, year]
    time_start    = year - 1,
    time_stop     = year,
    fid_num       = as.numeric(factor(fid)),
    # Standardize continuous predictors for interpretable hazard ratios
    elev_sc       = scale(elev)           |> as.numeric(),
    slope_sc      = scale(slope)          |> as.numeric(),
    rem_forest_sc = scale(rem_forest_area_prop_estate) |> as.numeric(),
    peat_sc       = scale(peat_prop_area) |> as.numeric(),
    age_lag_sc    = scale(age_median_lag) |> as.numeric(),
    pop_sc        = scale(log(population))|> as.numeric(),
    port_sc       = scale(Time2Port)      |> as.numeric(),
    area_sc       = scale(area)           |> as.numeric(),
    price_sc      = scale(po_price)       |> as.numeric()
  )

#Check for multicoliniarity
replant_vif_check <- lm(
  replant_event ~ elev_sc + slope_sc + rem_forest_sc + peat_sc +
    age_lag_sc + pop_sc + port_sc + area_sc + price_sc + rspo_active,
  data = plants_replant_first
)
vif(replant_vif_check)

#Fit Cox model
replant_cox <- coxph(
  Surv(time_start, time_stop, replant_event) ~
    elev_sc + slope_sc + rem_forest_sc + peat_sc + age_lag_sc +
    pop_sc + port_sc + area_sc + rspo_active,
  data    = plants_replant_first,
  cluster = fid_num
)

summary(replant_cox)

```

## Plot replanting Cox model coefficients

```{r}

coefs_replant <- summary(replant_cox, conf.int = 0.99)$conf.int |>
  as_tibble() |>
  select(c(1, 3, 4)) |>
  mutate(term = c("Average Elevation (m)",
                  "Average Slope (Deg.)",
                  "Remaining Forested Area (%)",
                  "Remaining Forested Peatland (%)",
                  "Median Age of Oil Palm (Prior Year)",
                  "Average Population Density w/n 2 KM (ln)",
                  "Travel Time to Port",
                  "Concession Area (sq. km)",
                  "RSPO Certified"))

names(coefs_replant)[1:3] <- c("coef", "lower", "upper")

replant_coef_plot <- ggplot(coefs_replant, aes(x = coef, y = term)) +
  geom_vline(xintercept = 1, linetype = 2, linewidth = 0.75, color = "black") +
  geom_point(size = 4, color = "black") +
  geom_segment(aes(x = lower, xend = upper, y = term, yend = term),
               linewidth = 1, color = "darkgray") +
  scale_x_continuous(
    "Hazard Ratio for One Standard-Deviation\nIncrease in Independent Variable"
  ) +
  scale_y_discrete("") +
  ggtitle(
    "Proportional Hazard Model Coefficients\nDV = First Replanting",
    subtitle = paste0(
      "N = ", n_distinct(plants_replant_first$fid), " plantations; ",
      sum(plants_replant_first$replant_event), " replanting events observed"
    )
  ) +
  theme_rspo

replant_coef_plot

png("replant_cox_coefs.png", width = 9, height = 5, units = "in", res = 150)
replant_coef_plot
dev.off()

```

# Propensity Score Matching
## Create function to output covariate balance summaries

A key requirement of propensity-score matching is that the matched treatment (RSPO-certified) and control (never certified) groups be balanced on the covariates used in the matching — that is, that the covariate distributions are similar between the two groups after matching. The `output_balance()` function extracts standardized mean differences from a `MatchIt` object using `cobalt::bal.tab()` and writes them to a CSV file for inclusion in the manuscript's appendix tables (Tables A1–A3). Standardized mean differences below 0.25 in absolute value are generally considered indicative of acceptable balance (Ho et al., 2011).

```{r}

output_balance <- function(x, thresholds = 0.2, prefix, varnames){
  temp_bal <- bal.tab(x, thresholds = thresholds)
  
  write.csv(temp_bal$Observations, paste0(prefix, "_obs.csv"), row.names = FALSE)
  
  temp_bal$Balance |>
    filter(Type != "Distance") |>
    select(Diff.Adj, M.Threshold) |>
    mutate(var = varnames,
           Diff.Adj = signif(Diff.Adj, digits = 2)) |>
    write.csv(paste0(prefix, "_balance.csv"), row.names=FALSE)
}

```

## Propensity-score matching for difference-in-differences analysis

The difference-in-differences analysis requires that certified and uncertified plantations be comparable on observable characteristics *before* certification began, so that changes in deforestation rates after certification can be attributed to the certification itself rather than to pre-existing differences. This is achieved through propensity-score matching (Section 4.6; Ho et al., 2011), implemented with the `MatchIt` package.

The propensity score is the estimated probability that a plantation would ever become certified, given its observable characteristics. Matching on the propensity score — rather than on each individual covariate separately — provides a one-dimensional summary of overall similarity. We use nearest-neighbor matching with a caliper of 0.2 standard deviations on the propensity score, which means that a certified plantation can only be matched to an uncertified one if their estimated propensity scores differ by no more than 0.2 standard deviations. This prevents poor-quality matches between plantations that are fundamentally dissimilar.

Matching is performed on the covariate values as of 2007 — the last year before RSPO certification became possible — to capture pre-treatment characteristics. Separate matched samples are constructed for the overall deforestation analysis and for the peatland-specific analysis (restricting the peatland sample to plantations with at least some peatland area). The `output_balance()` function writes the resulting standardized mean differences to CSV for inclusion in Table A1 of the manuscript.

```{r}

# Construct matching dataset: one row per plantation, covariates as of 2007
plants_match_data <- plants_model |>
  st_drop_geometry() |>
  mutate(ever_rspo = rspo_start < 9999) |>
  filter(year == 2007) |>
  mutate(population = log(population))  # Log-transform for normality

# Main matching: all plantations
plants_match_main <- matchit(ever_rspo ~  slope + rem_forest_area_prop_estate + 
                               elev,
                       data = plants_match_data,
                       caliper = 0.2)

output_balance(plants_match_main, prefix = "plants_matched_main",
               varnames = c("Average Slope (Deg.)",
                 "Forested Area (%)",
                 "Average Population Density w/n 2 KM (ln)"))

# Peatland matching: restricted to plantations with peatland present
plants_match_peat_data <- plants_match_data |>
  filter(rem_peat > 0)


plants_match_peat <- matchit(ever_rspo ~  slope + rem_forest_area_prop_estate + 
                               elev,
                       data = plants_match_peat_data,
                       caliper = 0.2)

output_balance(plants_match_peat, prefix = "plants_matched_peat",
               varnames = c("Average Slope (Deg.)",
                 "Forested Area (%)",
                 "Average Population Density w/n 2 KM (ln)"))

```


# Difference-in-differences analysis
## Create function to conduct and output difference-in-differences results

The `do_did()` function wraps the Callaway and Sant'Anna (2021b) difference-in-differences estimator, implemented in the `did` package. It is called repeatedly with different outcome variables and datasets to produce all the DiD results in the manuscript.

Callaway and Sant'Anna's (2021b) approach first estimates *group-time average treatment effects* (ATT(g,t)) — that is, the average treatment effect for the group of plantations first treated in year *g*, measured at calendar time *t*. This approach is preferred over conventional two-way fixed effects (TWFE) regression because TWFE implicitly uses already-treated units as controls for later-treated units, which can produce misleading estimates when treatment effects are heterogeneous over time or across groups (Imai & Kim, 2021). Callaway and Sant'Anna's method avoids this by using only never-treated units as controls (`control = "nevertreated"`).

The function produces two plots for each outcome:

- `plot_temp` (the first list element): The *calendar-time aggregation* — the average ATT aggregated by calendar year. This is the plot shown in the manuscript, summarizing how the average effect of certification on deforestation changes as more plantations become certified over time.
- `plot_event` (the second list element): The *event-study aggregation* — the average ATT plotted by the number of years before or after treatment onset. This format (sometimes called an "event study" or "dynamic DiD" plot) is used in Appendix A3 to visualize pre-treatment trends and assess the parallel trends assumption. A vertical line at event-time 0 marks the treatment onset. Evidence of significant pre-treatment effects (non-zero ATTs in periods labeled as negative) would suggest that the parallel trends assumption is violated.

All models use `alp = 0.01`, corresponding to 99% confidence intervals, consistent with the manuscript's reporting. The `allow_unbalanced_panel = TRUE` option accommodates the fact that not all plantations are observed in every year.

```{r}

do_did <- function(x, yname, gname = "rspo_start", idname = "fid", tname = "year",
                   xformla, est_method = "reg", plot_title, y_pct_scale = TRUE){
  temp <- att_gt(yname = yname, gname = gname, idname = idname,
              tname = tname, xformla = xformla, data = x, est_method = est_method,
              allow_unbalanced_panel = TRUE, control = "nevertreated", alp = 0.01
              )
  
  aggte(temp) |> print()
  
  # Calendar-time aggregation: average ATT by year (used in main results figures)
  plot_temp <- aggte(temp, type = "calendar", na.rm=TRUE) |> 
    ggdid() +
    geom_ribbon(fill = "darkgray", color = "black", linetype = 3, linewidth = 0.5, alpha = 0.5) +
    geom_point(color="black") +
    geom_errorbar(alpha = 0) +
    ggtitle(plot_title) +
    guides(color = "none") +
    theme_rspo
  
  # Event-study aggregation: average ATT by years since treatment
  # (used in appendix pre-trend checks)
  plot_event <- aggte(temp, type = "dynamic", na.rm=TRUE) |> 
    ggdid() +
    geom_vline(xintercept = 0, color = "black", linewidth = 1) +
    geom_ribbon(fill = "darkgray", color = "black", linetype = 3, linewidth = 0.5, alpha = 0.5) +
    geom_point() +
    geom_errorbar(alpha = 0) +
    scale_x_continuous("Years Pre-/Post-Treatment", limits = c(-7, 7)) +
    ggtitle(plot_title) +
    guides(color = "none") +
    theme_rspo
  
  if(y_pct_scale){
    plot_temp <- plot_temp +
      scale_y_continuous(labels=scales::percent)
    
    plot_event <- plot_event +
      scale_y_continuous(labels=scales::percent)
  }  
    
  return(list(plot_temp,
              plot_event))
}

```

## Difference-in-differences estimates — main plantation sample

Using the matched samples constructed above, this section estimates the difference-in-differences models for the main plantation sample. Six outcomes are estimated:

- **`forloss_prop_remain`**: Annual deforestation rate (forest loss as a proportion of remaining forest). This is the primary deforestation rate outcome.
- **`rem_forest_area_prop_estate`**: Remaining forest as a proportion of the total estate.
- **`any_def`**: A binary indicator equal to 1 if any forest loss occurred in a given year. This captures the *probability* of a deforestation event, which may reveal different patterns than the *rate*.
- **`peatloss_prop_remain`**: Annual peatland deforestation rate.
- **`peat_prop_2000`**: Remaining forested peatland as a proportion of the 2000 baseline.
- **`any_peat`**: Binary indicator of any peatland deforestation event.

Results are exported as PNG files. The `[[1]]` outputs (calendar-time aggregations) produce the figures in the main text; the `[[2]]` outputs (event-study aggregations) produce the pre-trend check figures in the Methods Appendix.

```{r}

ids <- match.data(plants_match_main)

plants_match_main_did <- plants_model |>
  filter(fid %in% ids$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_def = as.numeric(forloss_area > 0))

matched_for_did_main <- do_did(x = plants_match_main_did, yname = "forloss_prop_remain",
                           xformla = ~1,
                           plot_title = "Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_for_any_did_main <- do_did(x = plants_match_main_did, yname = "any_def",
                           xformla = ~1,
                           plot_title = "Deforestation Probability in RSPO vs. Non-RSPO Estates")

ids_peat <- match.data(plants_match_peat)

plants_match_peat_did <- plants_model |>
  filter(fid %in% ids_peat$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_peat = as.numeric(peatloss_prop_remain > 0))

matched_peat_did_main <- do_did(x = plants_match_peat_did, yname = "peatloss_prop_remain",
                           xformla = ~1,
                           plot_title = "Peat Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_peat_any_did_main <- do_did(x = plants_match_peat_did, yname = "any_peat",
                           xformla = ~1,
                           plot_title = "Peat Deforestation Probability in RSPO vs. Non-RSPO Estates", y_pct_scale = FALSE)

# Calendar-time aggregations
png("matched_dids_main.png", width = 10, height = 12, units = "in", res = 150)
matched_for_did_main[[1]] / matched_for_any_did_main[[1]] / matched_peat_did_main[[1]] / matched_peat_any_did_main[[1]]
dev.off()

# Event-study aggregations (pre-trend checks)
png("matched_dids_main_events.png", width = 10, height = 12, units = "in", res = 150)
matched_for_did_main[[2]] / matched_for_any_did_main[[2]] / matched_peat_did_main[[2]] / matched_peat_any_did_main[[2]] + 
  plot_layout(axes = "collect")
dev.off()


```


## Pre-treatment trend checks — main plantation sample

The parallel trends assumption — which requires that, in the absence of treatment, certified and uncertified plantations would have followed similar deforestation trajectories — cannot be directly tested, since we never observe the counterfactual. However, we can assess whether treated and control plantations were already diverging *before* treatment began, which would be evidence against the assumption.

To implement this check, the sample is restricted to years before 2008 (before RSPO certification was possible), and all certified plantations are assigned a hypothetical "treatment year" of 2003, so that Callaway and Sant'Anna's group-time framework can compute pre-treatment ATTs. Significant pre-treatment ATTs would indicate that the two groups were already on different trajectories before certification, undermining our causal interpretation. These plots are shown in Methods Appendix of the manuscript.

```{r}

test_pre <- plants_model |>
  mutate(ever_rspo = case_when((rspo_start == 9999) ~ 0,
                               .default = 2003)) |>
  ungroup() |>
  filter(year < 2008) |>
  filter(fid %in% ids$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_def = as.numeric(forloss_area > 0))

matched_for_did_main_check <- do_did(x = test_pre, yname = "forloss_prop_remain",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_for_any_did_check <- do_did(x = test_pre, yname = "any_def",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Deforestation Probability in RSPO vs. Non-RSPO Estates")

ids_peat <- match.data(plants_match_peat)

test_pre <- plants_model |>
  mutate(ever_rspo = case_when((rspo_start == 9999) ~ 0,
                               .default = 2003)) |>
  filter(year < 2008) |>
  filter(fid %in% ids$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_peat = as.numeric(peatloss_prop_remain > 0))

matched_peat_did_main_check <- do_did(x = test_pre, yname = "peatloss_prop_remain",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Peat Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_peat_any_did_main_check <- do_did(x = test_pre, yname = "any_peat",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Peat Deforestation Probability in RSPO vs. Non-RSPO Estates", y_pct_scale = FALSE)

png("matched_dids_main_check.png", width = 10, height = 12, units = "in", res = 150)
matched_for_did_main_check[[1]] / matched_for_any_did_check[[1]] / matched_peat_did_main_check[[1]] / matched_peat_any_did_main_check[[1]]
dev.off()


```

## Propensity-score matching — high-forest-cover subsample

As a robustness check, the difference-in-differences analysis is repeated on a subsample of plantations that retained substantial forest cover as of 2007 — specifically, those in the top quartile of remaining forest area (`rem_forest >= 0.74`). This check addresses the concern that the null main results are mechanically driven by plantations that had already converted all their forest before certification, leaving nothing for certification to protect.

```{r}

plants_match_data_high_for <- plants_model |>
  st_drop_geometry() |>
  mutate(ever_rspo = rspo_start < 9999) |>
  filter(year == 2007) |>
  mutate(population = log(population)) |>
  filter(rem_forest >= 0.74)  # Top quartile of remaining forest cover

plants_match_main_high_for <- matchit(ever_rspo ~  slope + rem_forest_area_prop_estate + 
                               elev,
                       data = plants_match_data_high_for,
                       caliper = 0.2)

output_balance(plants_match_main_high_for, prefix = "plants_matched_main_high_for",
               varnames = c("Average Slope (Deg.)",
                 "Forested Area (%)",
                 "Average Population Density w/n 2 KM (ln)"))

plants_match_peat_data_high_for <- plants_match_data_high_for |>
  filter(rem_peat > 0)


plants_match_peat_high_for <- matchit(ever_rspo ~  slope + rem_forest_area_prop_estate + 
                               elev,
                       data = plants_match_peat_data_high_for,
                       caliper = 0.2)

output_balance(plants_match_peat_high_for, prefix = "plants_matched_peat_high_for",
               varnames = c("Average Slope (Deg.)",
                 "Forested Area (%)",
                 "Average Population Density w/n 2 KM (ln)"))

```

## Difference-in-differences estimates — high-forest-cover subsample

The difference-in-differences and pre-treatment trend checks are repeated for the high-forest-cover subsample. The structure here mirrors the main analysis exactly. Results are reported in the Methods Appendix.

```{r}

ids <- match.data(plants_match_main_high_for)

plants_match_main_did_high_for <- plants_model |>
  filter(fid %in% ids$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_def = as.numeric(forloss_area > 0))

matched_for_did_main_high_for <- do_did(x = plants_match_main_did_high_for, yname = "forloss_prop_remain",
                           xformla = ~1,
                           plot_title = "Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_for_any_did_main_high_for <- do_did(x = plants_match_main_did_high_for, yname = "any_def",
                           xformla = ~1,
                           plot_title = "Deforestation Probability in RSPO vs. Non-RSPO Estates")

ids_peat <- match.data(plants_match_peat_high_for)

plants_match_peat_did_high_for <- plants_model |>
  filter(fid %in% ids_peat$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_peat = as.numeric(peatloss_prop_remain > 0))

matched_peat_did_main_high_for <- do_did(x = plants_match_peat_did_high_for, yname = "peatloss_prop_remain",
                           xformla = ~1,
                           plot_title = "Peat Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_peat_any_did_main_high_for <- do_did(x = plants_match_peat_did_high_for, yname = "any_peat",
                           xformla = ~1,
                           plot_title = "Peat Deforestation Probability in RSPO vs. Non-RSPO Estates", y_pct_scale = FALSE)

png("matched_dids_main_high_for.png", width = 10, height = 12, units = "in", res = 150)
matched_for_did_main_high_for[[1]] / matched_for_any_did_main_high_for[[1]] / matched_peat_did_main_high_for[[1]] / matched_peat_any_did_main_high_for[[1]]
dev.off()

# Event-study pre-trend checks for high-forest subsample
png("matched_dids_main_events_high_for.png", width = 10, height = 12, units = "in", res = 150)
matched_for_did_main_high_for[[2]] / matched_for_any_did_main_high_for[[2]] / matched_peat_did_main_high_for[[2]] / matched_peat_any_did_main_high_for[[2]] + 
  plot_layout(axes = "collect")
dev.off()


```


## Pre-treatment trend checks — high-forest-cover subsample

Pre-treatment trend checks for the high-forest-cover subsample, following the same procedure as for the main sample above. These results are shown in the Methods Appendix of the manuscript.

```{r}

test_pre <- plants_model |>
  mutate(ever_rspo = case_when((rspo_start == 9999) ~ 0,
                               .default = 2003)) |>
  ungroup() |>
  filter(year < 2008) |>
  filter(fid %in% ids$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_def = as.numeric(forloss_area > 0))

matched_for_did_main_check_high_for <- do_did(x = test_pre, yname = "forloss_prop_remain",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_for_any_did_check_high_for <- do_did(x = test_pre, yname = "any_def",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Deforestation Probability in RSPO vs. Non-RSPO Estates")

ids_peat <- match.data(plants_match_peat_high_for)

test_pre <- plants_model |>
  mutate(ever_rspo = case_when((rspo_start == 9999) ~ 0,
                               .default = 2003)) |>
  filter(year < 2008) |>
  filter(fid %in% ids$fid) |>
  mutate(fid_num = as.numeric(fid),
         any_peat = as.numeric(peatloss_prop_remain > 0))

matched_peat_did_main_check_high_for <- do_did(x = test_pre, yname = "peatloss_prop_remain",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Peat Deforestation Rates in RSPO vs. Non-RSPO Estates")

matched_peat_any_did_main_check_high_for <- do_did(x = test_pre, yname = "any_peat",
                           xformla = ~1, gname = "ever_rspo",
                           plot_title = "Peat Deforestation Probability in RSPO vs. Non-RSPO Estates", y_pct_scale = FALSE)

png("matched_dids_main_check_high_for.png", width = 10, height = 12, units = "in", res = 150)
matched_for_did_main_check_high_for[[1]] / matched_for_any_did_check_high_for[[1]] / matched_peat_did_main_check_high_for[[1]] / matched_peat_any_did_main_check_high_for[[1]]
dev.off()


```

## Propensity-score matching and difference-in-differences — plantation buffer zones

This section repeats the propensity-score matching and difference-in-differences analysis for the plantation buffer zones, testing whether RSPO certification is associated with changes in deforestation *outside* certified plantations. A significant positive effect in buffers would indicate deforestation leakage — that certification displaces rather than prevents deforestation. Note that the matching here uses the same set of covariates as the main analysis, but applied to the characteristics of the buffer zones themselves (with the exception of RSPO status itself).

Results are shown in the main text of manuscript and in the Methods Appendix.

```{r}

plants_buff_match_data <- plants_buff_model |>
  st_drop_geometry() |>
  mutate(ever_rspo = rspo_start < 9999) |>
  filter(year == 2007) |>
  mutate(population = log(population),
         Time2Port = log(Time2Port),
         area = log(area)) |>
  select(ever_rspo, slope, rem_forest_area_prop_estate, population, area) |>
  na.omit()

plants_buff_match_main <- matchit(ever_rspo ~  slope + rem_forest_area_prop_estate + 
                               population,
                       data = plants_buff_match_data,
                       caliper = 0.2)


output_balance(plants_buff_match_main, prefix = "plants_buff_matched_main",
               varnames = c("Average Slope (Deg.)",
                 "Forested Area (%)",
                 "Average Population Density w/n 2 KM (ln)"))

plants_buff_peat_match_data <- plants_buff_model |>
  st_drop_geometry() |>
  mutate(ever_rspo = rspo_start < 9999) |>
  filter(year == 2007) |>
  mutate(population = log(population),
         Time2Port = log(Time2Port),
         area = log(area)) |>
  select(ever_rspo, slope, rem_forest_area_prop_estate, age_median, population, elev,
         rem_peat) |>
  na.omit() |>
  filter(rem_peat > 0)


plants_buff_match_peat <- matchit(ever_rspo ~  slope + rem_forest_area_prop_estate + age_median +
                               population + elev,
                       data = plants_buff_peat_match_data,
                       caliper = 0.2)

output_balance(plants_buff_match_main, prefix = "plants_buff_matched_peat",
               varnames = c("Average Slope (Deg.)",
                 "Forested Area (%)",
                 "Average Population Density w/n 2 KM (ln)"))

```

## Difference-in-differences estimates — plantation buffer zones

```{r}

ids <- match.data(plants_buff_match_main)

plants_buff_match_main_data <- plants_buff_model |>
  filter(fid %in% ids$fid) |>
  mutate(any_def = as.numeric(forloss_area > 0),
         # Log-odds transformations are computed here but not used in the
         # final analysis; retained for reference
         forloss_prop_remain_logodds = 
           log((forloss_prop_remain+0.0001)/(1-forloss_prop_remain+0.0001)),
         any_peat = as.numeric(peatloss_prop_remain > 0),
         peat_prop_2000_logodds = 
           log((peat_prop_2000+0.0001)/(1-peat_prop_2000+0.0001)),
         peatloss_prop_remain_logodds = 
           log((peatloss_prop_remain+0.0001)/(1-peatloss_prop_remain+0.0001)),
         fid_num = as.numeric(fid))

matched_for_buff_did_main <- do_did(x = plants_buff_match_main_data, yname = "forloss_prop_remain",
                           xformla = ~1,
                           plot_title = "Deforestation Rates in RSPO vs. Non-RSPO Buffer Zones")

matched_for_buff_any_did_main <- do_did(x = plants_buff_match_main_data, yname = "any_def",
                           xformla = ~1,
                           plot_title = "Deforestation Probability in RSPO vs. Non-RSPO Buffer Zones")

ids_peat <- match.data(plants_buff_match_peat)

plants_buff_match_peat_data <- plants_buff_match_main_data |>
  filter(fid %in% ids_peat$fid) |>
  mutate(fid_num = as.numeric(fid))

matched_buff_did_peat <- do_did(x = plants_buff_match_peat_data,
                                     yname = "peatloss_prop_remain", xformla = ~1,
                           plot_title = "Peat Deforestation Rates in RSPO vs. Non-RSPO Buffer Zones", y_pct_scale = FALSE)

matched_any_buff_did_peat <- do_did(x = plants_buff_match_peat_data,
                                     yname = "any_peat", xformla = ~1,
                           plot_title = "Peat Deforestation Probability in RSPO vs. Non-RSPO Buffer Zones", y_pct_scale = FALSE)

# Calendar-time aggregations for buffer zones
png("matched_dids_buff_main.png", width = 10, height = 12, units = "in", res = 150)
matched_for_buff_did_main[[1]] / matched_for_buff_any_did_main[[1]] / matched_buff_did_peat[[1]] / matched_any_buff_did_peat[[1]] 
dev.off()

# Event-study pre-trend checks for buffer zones
png("matched_dids_buff_main_events.png", width = 10, height = 12, units = "in", res = 150)
matched_for_buff_did_main[[2]] / matched_for_buff_any_did_main[[2]] / matched_buff_did_peat[[2]] / matched_any_buff_did_peat[[2]] + plot_layout(axes = "collect")
dev.off()
```

# Entropy balancing for continuous treatment: oil palm age and deforestation

The difference-in-differences analysis above treats RSPO certification as a binary treatment — a plantation is either certified or not. However, the study's second research objective is to understand how the *age* of a plantation's oil palm stock — a continuous variable — affects both the likelihood of certification and the intensity of deforestation.

Entropy balancing, originally developed by Hainmueller (2017) for binary treatments, addresses this problem by finding a set of observation weights that renders the treatment variable statistically independent of a specified set of covariates in the reweighted sample. For binary treatments, this means equating the covariate means (and optionally variances and skewness) between treatment and control groups. Tübbicke (2022) and Vegetabile et al. (2021) independently extended this approach to continuous treatments: rather than balancing covariate means between groups, the extended method finds weights that drive the Pearson correlation between each covariate and the continuous treatment variable to zero across the full range of the treatment's distribution. This produces a reweighted sample in which differences in deforestation outcomes across values of median oil palm age cannot be attributed to differences in the observed covariates, allowing us to estimate a *dose-response curve* — the relationship between oil palm age and each deforestation outcome, holding other factors constant.

The method is implemented using the `WeightIt` package (Greifer, 2025), with `method = "ebal"` and `moments = 3`, which enforces balance on the first three moments (mean, variance, and skewness) of each covariate's relationship with median oil palm age. All covariates are standardized prior to estimation. Four sets of weights are estimated:

1. **`out_weights_all`**: All plantations, pooled across RSPO certification status. Used to estimate the overall dose-response relationship between oil palm age and deforestation.
2. **`out_weights_peat_all`**: Plantations with non-trivial peatland coverage (> 0.1% of area), pooled. Used for the peatland-specific dose-response analysis.
3. **`out_weights_by_rspo`**: All plantations, with weights estimated separately within each certification status group (`by = ~rspo_active`). This is the key model, as it allows us to compare dose-response curves between certified and uncertified plantations while ensuring that the comparison within each group is not confounded by the observed covariates. Note that `rspo_active` is excluded from the covariate list in this model, since it is the grouping variable.
4. **`out_weights_peat_by_rspo`**: Peatland-only plantations, split by certification status.

```{r}

# Construct the modeling dataset for entropy balancing
plants_convalue <- plants_model  |>
  ungroup() |>
  select(year, fid, age_median, elev, slope, rem_forest_area_prop_estate, peat_prop_area, population, Time2Port,
         area, meiv, po_price, rspo_active, forloss_prop_remain,
         peatloss_prop_remain) |>
  na.omit() |>
  mutate(elev = scale(elev),
         slope = scale(slope),
         rem_forest_area_prop_estate = scale(rem_forest_area_prop_estate),
         peat_prop_area_sc = scale(peat_prop_area),
         population = scale(population),
         Time2Port = scale(Time2Port),
         area = scale(area),
         meiv_sc = scale(meiv),
         po_price_sc = scale(po_price),
         # Binary deforestation indicators used as additional DVs
         any_def = as.numeric(forloss_prop_remain > 0),
         any_peat = as.numeric(peatloss_prop_remain > 0))

# Estimate entropy balancing weights for each of the four modeling scenarios
out_weights_all <- weightit(age_median ~ elev + slope + rem_forest_area_prop_estate + peat_prop_area_sc + 
                            population + Time2Port + area + meiv_sc + po_price_sc + rspo_active,
                        data = plants_convalue, method = "ebal", moments = 3)
bal.tab(out_weights_all, un = TRUE)

png("bal_cor_all.png", width = 8, height = 6, units = "in", res = 150)
love.plot(out_weights_all, stats = "correlations", un = TRUE,
          thresholds = c(cor = 0.1))
dev.off()

plants_convalue_peat <- plants_convalue[plants_convalue$peat_prop_area > 0.001,]

out_weights_peat_all <- weightit(age_median ~ elev + slope + rem_forest_area_prop_estate + peat_prop_area_sc + 
                            population + Time2Port + area + meiv_sc + po_price_sc + rspo_active,
                        data = plants_convalue_peat,
                        method = "ebal", moments = 3)
bal.tab(out_weights_peat_all)

# Split by RSPO status (rspo_active excluded from covariate list as it is the grouping variable)
out_weights_by_rspo <- weightit(age_median ~ elev + slope + rem_forest_area_prop_estate + peat_prop_area_sc + 
                            population + Time2Port + area + meiv_sc + po_price_sc,
                        data = plants_convalue, method = "ebal", moments = 3, by = ~rspo_active)
bal.tab(out_weights_by_rspo)

out_weights_peat_by_rspo <- weightit(age_median ~ elev + slope + rem_forest_area_prop_estate + peat_prop_area_sc + 
                            population + Time2Port + area + meiv_sc + po_price_sc,
                        data = plants_convalue_peat, by = ~rspo_active,
                        method = "ebal", moments = 3)
bal.tab(out_weights_peat_by_rspo)

```

# Entropy balance diagnostics - Love plots

The following function produces custom balance diagnostic figures that show the Pearson correlations between median oil palm age and each covariate before and after weighting. Successful entropy balancing should drive all weighted correlations close to zero, confirming that the dose-response estimates are not confounded by the observed covariates. Dashed reference lines at ±0.1 mark a conventional threshold for acceptable balance (following Tübbicke, 2022).

The function also exports the balance statistics to CSV for reference, and returns the plot object for further customization or combination using `patchwork`.

## Create function

```{r}

# Function to plot entropy balance diagnostics for continuous treatment
# Works with WeightIt objects using method = "ebal"
# For split models (by = ~rspo_active), set split = TRUE

output_ebal_balance <- function(weights_obj, 
                                 prefix, 
                                 varnames,
                                 split = FALSE,
                                 split_labels = c("Not RSPO Certified", "RSPO Certified"),
                                 threshold = 0.1,
                                 title = "") {
    
    bt <- bal.tab(weights_obj, stats = "correlations", un = TRUE)
    
    plot_data <- bt$Balance |>
      rownames_to_column("variable") |>
      filter(variable != "prop.score") |>
      mutate(var = varnames) |>
      select(var, Corr.Un, Corr.Adj) |>
      pivot_longer(cols = c(Corr.Un, Corr.Adj),
                   names_to = "weighted",
                   values_to = "correlation") |>
      mutate(weighted = case_match(weighted,
                                   "Corr.Un"  ~ "Unweighted",
                                   "Corr.Adj" ~ "Weighted"),
             weighted = factor(weighted, levels = c("Unweighted", "Weighted")),
             var = factor(var, levels = rev(varnames)))
    
    p <- ggplot(plot_data, aes(x = correlation, y = var, 
                                shape = weighted, color = weighted)) +
      geom_vline(xintercept = 0, color = "gray50", linewidth = 1) +
      geom_vline(xintercept = c(-threshold, threshold), 
                 color = "gray70", linewidth = 0.5, linetype = "dashed") +
      geom_point(size = 4, stroke = 1.2) +
      scale_shape_manual(values = c("Unweighted" = 1, "Weighted" = 16)) +
      scale_color_manual(values = c("Unweighted" = "gray50", "Weighted" = "black")) +
      labs(x = "Pearson Correlation with Median Oil Palm Age",
           y = "",
           title = title,
           shape = "",
           color = "") +
      theme_rspo +
      theme(legend.position = "bottom")
    
    # Export balance statistics to CSV
    plot_data |>
      pivot_wider(names_from = weighted, values_from = correlation) |>
      write.csv(paste0(prefix, "_ebal_balance.csv"), row.names = FALSE)
  
  return(p)
  
}

```

## Execute function

```{r}

# Variable names must match the order of covariates in the weightit() formula
ebal_varnames <- c("Average Elevation (m)",
                   "Average Slope (Deg.)",
                   "Remaining Forest (%)",
                   "Peatland Proportion",
                   "Population Density w/n 2 KM (ln)",
                   "Travel Time to Port",
                   "Plantation Area",
                   "ENSO Index",
                   "Palm Oil Price",
                   "RSPO Certification Status")

# For split models, rspo_active is the grouping variable and is omitted
ebal_varnames_nosplit <- ebal_varnames[1:9]

p_ebal_all <- output_ebal_balance(out_weights_all, 
                                   prefix = "ebal_all",
                                   varnames = ebal_varnames,
                                  title="Correlations with Median Oil Palm Age,\nAll Plantations")

p_ebal_peat <- output_ebal_balance(out_weights_peat_all,
                                    prefix = "ebal_peat",
                                    varnames = ebal_varnames,
                                  title="Correlations with Median Oil Palm Age,\nPlantations with Peatlands")

p_ebal_by_rspo <- output_ebal_balance(out_weights_by_rspo,
                                       prefix = "ebal_by_rspo",
                                       varnames = ebal_varnames_nosplit,
                                  title="Correlations with Median Oil Palm Age,\nAll Plantations,\nBalanced within RSPO Status")

p_ebal_peat_by_rspo <- output_ebal_balance(out_weights_peat_by_rspo,
                                            prefix = "ebal_peat_by_rspo",
                                            varnames = ebal_varnames_nosplit,
                                  title="Correlations with Median Oil Palm Age,\nPlantations with Peatlands,\nBalanced within RSPO Status")

# Export individual balance figures
png("ebal_balance_all.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_all
dev.off()

png("ebal_balance_peat.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_peat
dev.off()

png("ebal_balance_by_rspo.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_by_rspo
dev.off()

png("ebal_balance_peat_by_rspo.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_peat_by_rspo
dev.off()

# Combined balance figure for certified and uncertified splits
png("ebal_balance_all_by_rspo.png", width = 12, height = 10, units = "in", res = 150)
p_ebal_by_rspo/p_ebal_peat_by_rspo
dev.off()

```

# Dose-response curve estimation

Now we can estimate and plot the dose-response curves in Section 5.3 of the manuscript. These curves show the predicted value of each deforestation outcome as a function of median oil palm age, after adjusting for observed covariates through entropy balancing.

The estimation proceeds in two steps. First, the `dose_resp()` function fits a weighted regression of each outcome on a natural cubic spline of `age_median` (with 3 degrees of freedom), using the entropy balancing weights to adjust for confounding. Natural cubic splines are used because the relationship between oil palm age and deforestation is expected to be nonlinear — younger plantations may show different patterns than older ones — and splines allow this flexibility without imposing a specific functional form. For continuous outcomes (`forloss_prop_remain`, `peatloss_prop_remain`), a linear model (`lm_weightit`) is used; for binary outcomes (`any_def`, `any_peat`), a logistic regression (`glm_weightit`) is used to ensure predictions remain bounded between 0 and 1. Standard errors are clustered at the plantation level (`cluster = dat$fid`) to account for the dependence between multiple years of observations from the same plantation.

Second, `avg_predictions()` from `marginaleffects` (Arel-Bundock et al., 2024) evaluates the fitted model at each integer value of median oil palm age from 0 to 25 years, the 90th percentile for observed oil palm age, yielding the predicted average outcome at each age value with 99% confidence intervals.

When `separate = TRUE`, the regression includes an interaction between the spline of `age_median` and `rspo_active`, and predictions are computed separately for certified and uncertified plantations. This allows us to compare how the age-deforestation relationship differs between certified and uncertified plantations.

```{r}

#CREATE FUNCTION TO ESTIMATE DOSE-RESPONSE CURVES
dose_resp <- function(DV, dat, weights, dose_range, binary=FALSE, separate = separate){
  
  if(separate){
    # Interaction model: spline of age x RSPO status, for split dose-response curves
    form <- paste0(DV, " ~ splines::ns(age_median, df = 3)*rspo_active") |> as.formula()
  }else{
    # Pooled model: spline of age only
   form <- paste0(DV, " ~ splines::ns(age_median, df = 3)") |> as.formula() 
  }
  
  if(!binary){
    # Linear weighted regression for continuous DVs (deforestation rate)
    est <- lm_weightit(form, data = dat,
                    vcov = "asympt", weightit = weights,
                    cluster = dat$fid)
    
  }else{
    # Logistic weighted regression for binary DVs (deforestation probability)
    est <- glm_weightit(form, data = dat, family = binomial(link="logit"),
                    vcov = "asympt", weightit = weights, type = "probs",
                    cluster = dat$fid)
  }
  
  if(separate){
    # Compute average predicted outcome at each age value, separately by RSPO status
    dose_preds <- avg_predictions(est, conf_level = 0.99,
                     variables = list(age_median = dose_range,
                                      rspo_active = c(FALSE,TRUE))) |>
    tidy() |>
    mutate(DV = DV)
  }else{
   dose_preds <- avg_predictions(est, conf_level = 0.99,
                     variables = list(age_median = dose_range)) |>
    tidy() |>
    mutate(DV = DV) 
  }
  
  return(dose_preds)
  
}

#CREATE FUNCTION TO PLOT DOSE-RESPONSE CURVES
plot_dose_resp <- function(DV, dat, weights, dose_range, binary=FALSE, gname, y_pct_scale = TRUE,
                           separate = FALSE){
    
    if(separate){
      dose_plot <- dose_resp(DV = DV, dat = dat, weights = weights,
                  dose_range = dose_range, binary = binary, separate = separate) |>
                  ggplot(aes(x = age_median)) +
                    geom_line(aes(y = estimate)) +
                    geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
                      alpha = .3) +
                    labs(x = "Median Oil Palm Age", y = "",
                         title = gname) +
                    facet_wrap(~rspo_active, ncol = 2,
                               labeller = as_labeller(c("FALSE"="Not RSPO Certified",
                                      "TRUE"="RSPO Certified"))) +
                    theme_rspo
    }else{
      dose_plot <- dose_resp(DV = DV, dat = dat, weights = weights,
                  dose_range = dose_range, binary = binary, separate = separate) |>
                  ggplot(aes(x = age_median)) +
                    geom_line(aes(y = estimate)) +
                    geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
                      alpha = .3) +
                    labs(x = "Median Oil Palm Age", y = "",
                         title = gname) +
                    theme_rspo
    }
  
  if(y_pct_scale){
    dose_plot <- dose_plot +
      scale_y_continuous(labels=scales::percent)
  }
  
  return(dose_plot)
}

```

## Compute Dose-Response Curves by Oil Palm Age

The dose-response curves are estimated for all four deforestation outcomes, both pooled across RSPO status and split by RSPO status.

```{r}

# Pooled dose-response curves (all plantations)
dose_resp_def_all <- plot_dose_resp(DV = "forloss_prop_remain", dat = plants_convalue, weights = out_weights_all,
                  dose_range = 1:25, binary = FALSE,
                  gname = "Deforestation Rate")

dose_resp_any_def_all <- plot_dose_resp(DV = "any_def", dat = plants_convalue, weights = out_weights_all,
                  dose_range = 1:25, binary = TRUE,
                  gname = "Deforestation Probability")

dose_resp_peat_all <- plot_dose_resp(DV = "peatloss_prop_remain", dat = plants_convalue_peat,
                                     weights = out_weights_peat_all,
                  dose_range = 1:25, binary = FALSE,
                  gname = "Peat Deforestation Rate")

dose_resp_any_peat_all <- plot_dose_resp(DV = "any_peat", dat = plants_convalue_peat,
                                     weights = out_weights_peat_all,
                  dose_range = 1:25, binary = FALSE,
                  gname = "Peat Deforestation Probability")

# Split by RSPO status
dose_resp_def_sep <- plot_dose_resp(DV = "forloss_prop_remain", dat = plants_convalue, weights = out_weights_all,
                  dose_range = 1:25, binary = FALSE,
                  gname = "Deforestation Rate", separate = TRUE)

dose_resp_any_def_sep <- plot_dose_resp(DV = "any_def", dat = plants_convalue, weights = out_weights_all,
                  dose_range = 1:25, binary = TRUE,
                  gname = "Deforestation Probability", separate = TRUE)

dose_resp_peat_sep <- plot_dose_resp(DV = "peatloss_prop_remain", dat = plants_convalue_peat,
                                     weights = out_weights_peat_all,
                  dose_range = 1:25, binary = FALSE,
                  gname = "Peat Deforestation Rate", separate = TRUE)

dose_resp_any_peat_sep <- plot_dose_resp(DV = "any_peat", dat = plants_convalue_peat,
                                     weights = out_weights_peat_all,
                  dose_range = 1:25, binary = TRUE,
                  gname = "Peat Deforestation Probability", separate = TRUE)

# Export pooled dose-response figure
png("dose_resp_all.png", width = 10, height = 12, units = "in", res = 150)
dose_resp_def_all / dose_resp_any_def_all / dose_resp_peat_all / dose_resp_any_peat_all + plot_layout(axes = "collect")
dev.off()

# Split dose-response curves by RSPO certification status
png("dose_resp_sep.png", width = 10, height = 12, units = "in", res = 150)
dose_resp_def_sep / dose_resp_any_def_sep / dose_resp_peat_sep / dose_resp_any_peat_sep + plot_layout(axes = "collect")
dev.off()

```

## Entropy balancing and dose-response analysis by plantation buffer zones

This section extends the entropy balancing analysis to the plantation buffer zones — areas within 2 kilometers of a certified plantation that are not within 2 kilometers of any other plantation. The goal is to assess whether the relationship between oil palm age and deforestation in these near-plantation areas differs between zones adjacent to RSPO-certified plantations and those adjacent to uncertified plantations. Because the buffer zone dataset uses plantation fid as its identifier, standard errors are clustered at the plantation level to account for temporal dependence within units. The logic is exactly analogous to the proceeding.


```{r}


# Prepare buffer zone dataset for entropy balancing
plants_buff_convalue <- plants_buff_model |>
  st_drop_geometry() |>
  ungroup() |>
  select(year, fid, age_median, elev, slope, rem_forest_area_prop, peat_prop_area,
         population, Time2Port, area, meiv, po_price, rspo_active,
         forloss_prop_remain, peatloss_prop_remain) |>
  na.omit() |>
  mutate(elev = scale(elev),
         slope = scale(slope),
         rem_forest_area_prop = scale(rem_forest_area_prop),
         peat_prop_area_sc = scale(peat_prop_area),
         population = scale(population),
         Time2Port = scale(Time2Port),
         area = scale(area),
         meiv = scale(meiv),
         po_price = scale(po_price),
         any_def = as.numeric(forloss_prop_remain > 0),
         any_peat = as.numeric(peatloss_prop_remain > 0))

# Restrict peatland subsample to buffer zones with non-trivial peatland coverage
plants_buff_convalue_peat <- plants_buff_convalue[plants_buff_convalue$peat_prop_area > 0.001,]

#Estimate entropy balancing weights
out_weights_buff_all <- weightit(age_median ~ elev + slope + rem_forest_area_prop +
                                   peat_prop_area_sc + population + Time2Port +
                                   area + meiv + po_price + rspo_active,
                                 data = plants_buff_convalue,
                                 method = "ebal", moments = 3)
bal.tab(out_weights_buff_all, un = TRUE)

out_weights_buff_peat_all <- weightit(age_median ~ elev + slope + rem_forest_area_prop +
                                        peat_prop_area_sc + population + Time2Port +
                                        area + meiv + po_price + rspo_active,
                                      data = plants_buff_convalue_peat,
                                      method = "ebal", moments = 3)
bal.tab(out_weights_buff_peat_all)

# Split by adjacent plantation's RSPO certification status
# (rspo_active omitted from covariates as it is the grouping variable)
out_weights_buff_by_rspo <- weightit(age_median ~ elev + slope + rem_forest_area_prop +
                                       peat_prop_area_sc + population + Time2Port +
                                       area + meiv + po_price,
                                     data = plants_buff_convalue,
                                     method = "ebal", moments = 3,
                                     by = ~rspo_active)
bal.tab(out_weights_buff_by_rspo)

out_weights_buff_peat_by_rspo <- weightit(age_median ~ elev + slope + rem_forest_area_prop +
                                            peat_prop_area_sc + population + Time2Port +
                                            area + meiv + po_price,
                                          data = plants_buff_convalue_peat,
                                          by = ~rspo_active,
                                          method = "ebal", moments = 3)
bal.tab(out_weights_buff_peat_by_rspo)

# Balance diagnostics
p_ebal_buff_all <- output_ebal_balance(out_weights_buff_all,
                                       prefix = "ebal_buff_all",
                                       varnames = ebal_varnames,
                                       title = "Correlations with Median Oil Palm Age,\nAll Buffer Zones")

p_ebal_buff_peat <- output_ebal_balance(out_weights_buff_peat_all,
                                        prefix = "ebal_buff_peat",
                                        varnames = ebal_varnames,
                                        title = "Correlations with Median Oil Palm Age,\nBuffer Zones with Peatlands")

p_ebal_buff_by_rspo <- output_ebal_balance(out_weights_buff_by_rspo,
                                           prefix = "ebal_buff_by_rspo",
                                           varnames = ebal_varnames_nosplit,
                                           title = "Correlations with Median Oil Palm Age,\nAll Buffer Zones,\nBalanced within RSPO Status")

p_ebal_buff_peat_by_rspo <- output_ebal_balance(out_weights_buff_peat_by_rspo,
                                                prefix = "ebal_buff_peat_by_rspo",
                                                varnames = ebal_varnames_nosplit,
                                                title = "Correlations with Median Oil Palm Age,\nBuffer Zones with Peatlands,\nBalanced within RSPO Status")

png("ebal_balance_buff_all.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_buff_all
dev.off()

png("ebal_balance_buff_peat.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_buff_peat
dev.off()

png("ebal_balance_buff_by_rspo.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_buff_by_rspo
dev.off()

png("ebal_balance_buff_peat_by_rspo.png", width = 12, height = 7, units = "in", res = 150)
p_ebal_buff_peat_by_rspo
dev.off()

png("ebal_balance_buff_all_by_rspo.png", width = 12, height = 10, units = "in", res = 150)
p_ebal_buff_by_rspo / p_ebal_buff_peat_by_rspo
dev.off()

# Dose-response curves for buffer zones

# Pooled dose-response curves (all buffer zones)
dose_resp_buff_def_all <- plot_dose_resp(DV = "forloss_prop_remain",
                                         dat = plants_buff_convalue,
                                         weights = out_weights_buff_all,
                                         dose_range = 1:25,
                                         binary = FALSE,
                                         gname = "Deforestation Rate (Buffer Zones)")

dose_resp_buff_any_def_all <- plot_dose_resp(DV = "any_def",
                                              dat = plants_buff_convalue,
                                              weights = out_weights_buff_all,
                                              dose_range = 1:25,
                                              binary = TRUE,
                                              gname = "Deforestation Probability (Buffer Zones)")

dose_resp_buff_peat_all <- plot_dose_resp(DV = "peatloss_prop_remain",
                                          dat = plants_buff_convalue_peat,
                                          weights = out_weights_buff_peat_all,
                                          dose_range = 1:25,
                                          binary = FALSE,
                                          gname = "Peat Deforestation Rate (Buffer Zones)")

dose_resp_buff_any_peat_all <- plot_dose_resp(DV = "any_peat",
                                               dat = plants_buff_convalue_peat,
                                               weights = out_weights_buff_peat_all,
                                               dose_range = 1:25,
                                               binary = TRUE,
                                               gname = "Peat Deforestation Probability (Buffer Zones)")

# Split by RSPO status of adjacent plantation (key research question)
dose_resp_buff_def_sep <- plot_dose_resp(DV = "forloss_prop_remain",
                                          dat = plants_buff_convalue,
                                          weights = out_weights_buff_all,
                                          dose_range = 1:25,
                                          binary = FALSE,
                                          gname = "Deforestation Rate (Buffer Zones)",
                                          separate = TRUE)

dose_resp_buff_any_def_sep <- plot_dose_resp(DV = "any_def",
                                              dat = plants_buff_convalue,
                                              weights = out_weights_buff_all,
                                              dose_range = 1:25,
                                              binary = TRUE,
                                              gname = "Deforestation Probability (Buffer Zones)",
                                              separate = TRUE)

dose_resp_buff_peat_sep <- plot_dose_resp(DV = "peatloss_prop_remain",
                                           dat = plants_buff_convalue_peat,
                                           weights = out_weights_buff_peat_all,
                                           dose_range = 1:25,
                                           binary = FALSE,
                                           gname = "Peat Deforestation Rate (Buffer Zones)",
                                           separate = TRUE)

dose_resp_buff_any_peat_sep <- plot_dose_resp(DV = "any_peat",
                                               dat = plants_buff_convalue_peat,
                                               weights = out_weights_buff_peat_all,
                                               dose_range = 1:25,
                                               binary = TRUE,
                                               gname = "Peat Deforestation Probability (Buffer Zones)",
                                               separate = TRUE)

# Export figures
png("dose_resp_buff_all.png", width = 10, height = 12, units = "in", res = 150)
dose_resp_buff_def_all / dose_resp_buff_any_def_all /
  dose_resp_buff_peat_all / dose_resp_buff_any_peat_all +
  plot_layout(axes = "collect")
dev.off()

png("dose_resp_buff_sep.png", width = 10, height = 12, units = "in", res = 150)
dose_resp_buff_def_sep / dose_resp_buff_any_def_sep /
  dose_resp_buff_peat_sep / dose_resp_buff_any_peat_sep +
  plot_layout(axes = "collect")
dev.off()

```